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ARB gpu shader fp64 


e New GLSL types: double, dvecX and dmatX 
e Variables, Inputs, outputs, uniforms, constants (LF suffix) 
* No 64-bit vertex attributes (handled by another extension) 
* No 64-bit fragment shader outputs (no 64-bit FBs) 
* Arithmetic and relational operators supported 
* Most built-in functions can take the new types 
« Some exceptions: angle, trigonometry, exponential, noise 
* New packing functions: (un)packDouble2x32 
* Conversions from/to 32-bit types 


* No interpolation 
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Scope (1965) 


e Combined work by Intel and Igalia for over a year 


e ~260 new patches to add NIR (-60) and 1965 support (-100 
Broadwell+, ~100 Haswell) 


* More patches on the way! 
* Usable across all shader stages 
e 3 IRs to support: NIR, 1965/align1, 1965/align16 


* Lots of GL functionality involved: ALU, varyings, uniforms, UBO, 
SSBO, shared variables, etc. 


* Lots of internal driver modules involved: Optimization passes, 
liveness analysis, register spilling, etc. 
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Scope (Piglit) 


* Piglit coverage was limited 


e Mostly focused on ALU operations. 
e ~2,000 new tests added 
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NIR 


e Added support for bit-sized ALU types: 

e nir type float32 = 32 | nir type float, etc 

e nir alu get type size(), nir alu get base type() 
e We need to be a bit more careful now: 


- if (alu info.output type == nir type bool) { 
+if (nir alu type get base type(alu info.output type) == nir type bool) { 
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NIR 


Lots of plumbing to deal correctly with bit-sizes 
everywhere. 


Algebraic rules and bit-size validation 
New double-precision opcodes: 

e d2f, d2i, d2u, d2b, f2d, i2d, u2d 

e (un)pack double 2x32 

etc 
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NIR 


e Added lowering for unsupported 64-bit operations on 
Intel GPU hardware: 


e trunc, floor, cell, fract, round, mod, rcp, sqrt, rsq 
e nir lower doubles(). Implemented in terms of: 

* Supported 64-bit operations 

e 32-bit float/integer math 
e Drivers can choose which lowerings to use: 

e nir lower doubles options 
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1965 — Fp64 
(Align1) 


Align1 - SIMD3 - 32bit 


cO ci c2 c3 c4 cb c6 cf 


90 [__]Row1 
gl C__]Row2 


g2 


mov(8) { align1 1Q } 
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Align1 - SIMD3 - 64bit 


C3 c4 cb c6 cf 


C2 


192B 
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Align1 - SIMD3 - 64bit 
(Haswell+) 


co ci. c2 c3 c4 cb c6 c7 
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Align1 - SIMD3 - 64bit 
(Haswell+) 


cO ci c2 c3 c4 cb c6 c7 


eaB| (CD) Kd] Kd] oi 
968 |O) KA Kol IE) 


mov(8) { align1 1Q } 


* Use the subscript() helper: 
subscript(reg, BRW REGISTER TYPE UD, 1) 
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1965 — Fp64 
(Align16) 


Align16 - SIMD4x2 - 32bit 


1 cO cl c2 c3 1 c4 cb c6 c7 


sl | | | {| | [o Le: 


PJ] bel YY] Valo Er: 
648 A ++ TT 92 

:X Y Z Wi X Y Z WwW 

1 OB boundary 1 16B boundary 


{ align16 10 + 


Fixed Fixed 
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Align16 - 64bit 


* Broadwell+ used to need Align16 for Geometry and 
Tessellation shaders. 
* Nowadays these platforms are fully scalar and Align1 support is 
sufficient to expose Fp64 
e Older platforms (Haswell, IvyBridge, etc) still need Align16 for 
Vertex, Geometry and Tessellation shaders. 
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Align16 - 64bit 


* Swizzle channels are 32-bit, even on 64-bit operands 
e We can only address DF components XY directly! 
* Writemasks are 64-bit for DF destinations though 


e WRITEMASK XY and WRITEMASK ZW are 32-bit though ^ no 
native representation 
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Align16 - SIMD4x2 - 64bit 


cO ci c2 c3 c4 cb c6 c7 


Vertex 1 og 90 


Vertex 2 328 91 Thread boundary 
ee | | | | | | | Ja 


{ align16 10 + 


Fixed 
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Align16 - SIMD4x2 - 64bit 


cO ci c2 c3 c4 cb c6 c7 


328 | » | v > «AIM 
ee | | | | | | | Ja 


{ align16 1Q } 


* Align16 requires 16B alignment 
> 2 DF components in each row. 


e Vstride=2 to cover the entire region 
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Align16 - SIMD4x2 - 64bit 


cO cl c2 c4 cb c6 c7 


ee | | | | | | | Ja 


{ align16 1Q } 
e Each 16B region applies the 4-component 32-bit swizzle 
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Align16 - SIMD4x2 - 64bit 


cO cl c2 c3 c4 cb c6 c7 


= 
64B ME 5 A ER g2 
osL | | | | | | | Ja 


{ align16 1Q } 


. We can't do all swizzle combinations: 
e XXXX, YZYZ, XYYZ, etc. are not supported.. 
e We need to translate our 64-bit swizzles to 32-bit. 
- X-XYY- ZWZ-3,W-? 
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Align16 - SIMD4x2 - 64bit 
XY | ZW component splitting 


co cl c2 c3:c4 co c6 cf 
o Dx D Ex Ex Tx TY Tu] e 
328 gi Lm? 
648 g2 
og | Pvp yf | (A Jas 


' Thread boundary 


g2<1>.xywDF g0<2,2,1>.zwyDF 
Y 


g2<1>.xyDF g0+1<2,2,1>.xyzwDF 
g2+1<1>.yDF g0<2,2,1>.zwzwDF 


PF» - e 
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Align16 - SIMD4x2 - 64bit 
XY | ZW component splitting 


cO cl c2 C3 1 c4 cb c6 c7 


og | x | x | v; | v. | 90 
32B zlzlwlwizlzlwlw.[e- Wrong! 
sal x | x | | Ext xt | To 
ef | | | { | | | Jø 
' Thread boundary 
emask [3 ] a] 
Notused—[ o | o | o | o | Thread boundary 


mov(4) { align16 1Q) 
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Align16 - SIMD4x2 - 64bit 


cO cl c2 c3 c4 cb c6 cf 


{ align16 10 + 


e Back to square one... 
e Z > XY, W - ZW (at 16B offset) 
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Align16 - SIMD4x2 - 64bit 


cO ci c2 c3 c4 cb c6 c7 


C align16 10 + 
e Not good enough (and violates register region restrictions) 


e We could use a combination of vstride=0 and SIMD splitting. 
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Align16 - SIMD4x2 - 64bit 


e Gen7 hardware seems to have an interesting bug feature: 


e The second half of a compressed instruction with vstride=0 will ignore 
the vstride and offset exactly 1 register 


e We can use this to avoid the SIMD splitting 
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Align16 - SIMD4x2 - 64bit 


cO cl c2 c3 c4 
Zi 


OB 
32B 
64B 
96B 


{ align16 10 + 
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Align16 - SIMD4x2 - 64bit 


cO ci c2 c3 c4 cb c6 c7 


{ align16 1Q } 


e Remember that issue with 32-bit writemasks? 
e WRITEMASK XY == WRITEMASK X 
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Align16 - SIMD4x2 - 64bit 


co cl c2 c3 c4 c5 c6 cf 
OB Z 
32B 
64B 
96B 


{ align16 1Q } 


{ align16 10 + 
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Align16 - SIMD4x2 - 64bit 


* This is just an example: 
* Different swizzle combinations may require different implementations 
* 2-src instructions and 3-src instructions 
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Align16 - SIMD4x2 - 64bit 


e Implementation: 


Step 1: scalarize everything, swizzle translation at codegen - Done 


Step 2: let through swizzle classes that we can support natively (e.g. 
XYZW) - Done 


Step 3: let through swizzle classes that we can support by exploiting 
the vstride=0 behavior (e.g. XXXX) - Done 


Step 4: use component splitting (partial scalarization) to Support 
more swizzle classes — Not Done (yet) 
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1965 — Fp64 
Common Issues 


Multiple hardware generations 


e Significant differences between IvyBridge, Haswell and 
Broadwell+ hardware 
e Skylake did not require specific adaptations 


e Broxton, CherryView and Braswell only required minor tweaks: 
- 32b to 64b conversions need 64b aligned source data 
- 64b indirect addressing not supported 
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32-bit driver 


e Before fp64 all GLSL types were implemented as 32-bit types. 


* Driver code assumed 32-bit types (and even hstride=1) in lots of 
places. 


e Manyfixes like: 
- int dst width = inst->exec size / 8; 


+ int dst width = 
DIV ROUND UPf(inst->dst.component size(inst->exec size), REG SIZE); 


. * This could happen anywhere in the driver 


* Piglit was the driving force to find these 
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Unfamiliar code patterns 


e Fp64 operation produces new code patterns: 
e 32-bit access patterns on low/high 32-bit chunks of 64-bit data 
* Horizontal strides != 1 

* Some parts of the driver did not handle these scenarios 
properly. 


- Copy-propagation received at least 7 patches! 
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32-bit read/write messages 


All read/write messages are 32-bit 

e Pull loads, UBOs, SSBOs, URB, scratch... 

64bit data needs to be shuffled into 32-bit channels before 
writing 

32bit data reads need to be shuffled into valid 64bit data 
channels 

e shuffle 64bit data for 32bit write() 

« shuffle 32bit load result to 64bit data() 
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32-bit read/write messages 
(SIMD8 read) 


cO cl c2 c3 c4 cb c6 cf 


OB go 
32B | v Ev [v [v [v [v | v [| v Je 
64B | | | | g2 
968 [w/] w [ w [ » Iw [w Lw [oa 


4x4B = 16B 
+ Just what we want for 32-bit scalar operation 
e 8x16B SIMD8 read messages 
e Separate variables (registers) for each component 
* Consecutive components in consecutive registers 
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32-bit read/write messages 
(SIMD8 read) 


cO cl c2 c3 c4 cb c6 cf 


read(+0B) > 
OB 


32B 


Invalid 64b data 


g2 
g3 


read(+16B) g4 


219 
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32-bit read/write messages 
(SIMD8 read) 


cO cl c2 c3 c4 cb c6 cf 


read(+0B) E 90 
32B gl 
64B g 
968 93 
«9 Te ERE EE EE « 
160B FRESERFAFSESENES zu ES 
192B| W | ws | 96 
224B | 97 
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64-bit immediates (gen?) 


e No support for 64-bit immediates (Haswell, IvyBridge) 


Haswell provides the DIM instruction specifically for this purpose, we 
just had to add support for this in the driver. 


* |vyBridge requires that we emit code to load each 32-bit chunk of the 
constant into a register and then return either a XXXX swizzle 
(align16) or a stride O (align1). 
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Bugs I restrictions (Align1) 


e Second half of compressed instructions that don't write all 
channels have wrong emask (Haswell) 


e Requires SIMD splitting 


e Second half of compressed 64-bit instructions has wrong 
emask (IvyBridge) 


e Requires unconditional SIMD splitting of all 64-bit instructions :-( 
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Bugs I restrictions (Align16) 


e Vertical stride O doesn't work (gen?) 
e The second half of a compressed instruction will invariably offset a full register 
* Requires SIMD splitting 

* Instructions that write 2 registers must also read 2 registers (gen?) 


* This was a known issue in gen7, but it was never triggered in Align16 before 
Fp64 


* Requires SIMD splitting 


A simple test that just copies a DF uniform to the output hits both of 
these bugs! :-( 
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Bugs I restrictions (Align16) 


e 3-src instructions can't use RepCtrl=1 
e Not supported for 64-bit instructions 


e Only affects to MAD 
e RepCtrl=0 leads to <4,4,1>:DF regions so it can only be used to work 
with components XY 
- Requires temporaries and component splitting, but leads to quite bad code in 
general 
- Avoiding MAD altogether seems a better option for now 
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Bugs I restrictions (Align16) 


* Compressed bcsel 
* Does not read the predication mask properly 
e Requires SIMD splitting 
e Dependency control 
e Can't be used with 64-bit instructions > GPU hangs 
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Current State 


Status 


e Skylake: Available in Mesa 12.0 
Broadwell: Available in Mesa 12.0 


* Haswell: 

e ARB gpu shader fp64: Implemented, in review 
* ARB vertex attrib 64bit: Implemented 
IvyBridge: 


e ARB gpu shader fp64: Align1 implemented, Align16 implementation 
in progress 


* ARB vertex attrib 64bit: Not started 
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Questions? 


